from IPython.display import Image
is concerned with constructing mathematical models and quantitative analysis techniques and using computers to analyze and solve scientific problems.
Scientists and engineers develop computer programs, application software, that model systems being studied and run these programs with various sets of input parameters. In some cases, these models require massive amounts of calculations (usually floating-point) and are often executed on supercomputers or distributed computing platforms.
Numerical analysis is an important underpinning for techniques used in computational science.
is the set of methods by which scientists and engineers solve complex problems using apps that require high bandwidth, low latency networking and high computing capabilities.
Image(filename="bd_hpc.png")
Image(filename="whyhpc1.png")
Image(filename="whyhpc2.png")
Image(filename="500eurofr_HR.jpg")
how the previous one compares well to these ones?
Image("photo-intel-haswell-core-4th-gen-logos.jpg")
what we want to measure it:
In 1965$^{(1)}$, Intel co-founder Gordon Moore predicted (from just 3 data points!) that semiconductor density would double every 18 months. – He was right! Transistors are still shrinking at the same rate
$^{(1)}$Moore, G.E.: Cramming more components onto integrated circuits. Electronics, 38(8), April 1965.
Image("moores.jpeg")
Image("1200px-Moore's_Law_Transistor_Count_1971-2016.png")
available here
This has become a de-facto standard accepted by:
Consequences: An incredible level of integration
right now, we just assume that stuff like hardware threads, cores, whatever are different computing elements in one integrated chip
-Today, we commonly acquire chips with 1’000’000’000 (10$^9$) transistors!
Image("titan.jpg")
-GPU Name: GM200
-GPU Variant: GM200-400-A1
-Architecture: Maxwell 2.0
-Process Size: 28 nm
-Transistors: 8,000 million
-Die Size: 601 mm$^2$
But if computers double their power every 18 months we don't need to worry about performance
not quite
That used to be the case in the "days of the Pentium":
Image("oldays.png")
a implicit agreement:
however ...
Image("power.png")
not only this sucks up a lot of energy (and money) but creates a lot of heat as $Q\propto P$
Image("sun.png")
see also: [1]S. H. Fuller and L. I. Millett, “Computing Performance: Game Over or Next Level?,” Computer, vol. 44, no. 1, pp. 31–38, Jan. 2011.
so what solution did the smart guys at HW dept. invent?
Image("power2.png")
Image("gskill-hwbot-overclock-kit-cinebench-r15.jpg")
... nitrogen cooling is cool (literally) but quite expensive
Image("natick.jpg")
this one may be rather difficult if you are not a scuba lover
and we still have to pay the electricity bill
To switch faster (increase frequency) we have drive current trought a circuit more quickly. Now, the circuit has various fixed resistances, hence (naively) we to increase voltage (all things equal):$V=IR$
How much power it consumes?
At any time the circuit hold a charge $q=CV$ and hence performs a work equal to $W=qV=CV^2$
power consupmtion is $Wt^{-1}=Wf=CV^2f$ (plus fixed terms) where $f$ is the operating frequency.
Consider now what happens splitting the CPU in two cores operating at half the clock speed:
Image("more_cores.png")
[1]S. H. Fuller and L. I. Millett, “Computing Performance: Game Over or Next Level?,” Computer, vol. 44, no. 1, pp. 31–38, Jan. 2011.
As a practical example the E5640 Xeon (4 cores @ 2.66 GHz) has a power envelope of 95 watts while the L5630 (4 Cores @ 2.13 GHz) requires only 40 watts. That's 137% more electrical power for 24% more CPU power for CPU's that are for the most part feature compatible. The X5677 pushes the speed up to 3.46 GHz with some more features but that's only 60% more processing power for 225% more electrical power.
Now compare the X5560 (2.8 GHz, 4 cores, 95 watts) with the newer X5660 (2.8 GHz, 6 cores, 95 watts) and there's 50% extra computing power in the socket
Image("manycores.png")
note that three out six here come from the game industry
but what about the gentleman agreement with the HW dept.
well ...
we have to cope with this new contract:
HW dept. will continue to add transistors (in lots of simple cores) ...
... and SW dept. will have to adapt (rewrite everything).
That's way we have to worry about parallel programming
Market forces, not technology, will drive technology (core counts in this case)
(Nietzsche would like this one)
Image("manycores2.png")
Fragmentation into multiple execution units and more complicated logic creates additional hindrance for performance:
why so many "walls"?
Computing units are "ancient":
Matrix/vector multiplication in assembly:
__Z6matmulv (snippet):
vmovlhps
%xmm0, %xmm3, %xmm3
vmovss
+_b(%rip), %xmm4
vinsertf128 $1, %xmm3, %ymm3, %ymm3
vinsertps
$0x10, 44+_b(%rip), %xmm7,
vmovss
48+_b(%rip), %xmm6
vinsertps
$0x10, 36+_b(%rip), %xmm1,
vmovlhps
%xmm0, %xmm2, %xmm2
vinsertps
$0x10, 60+_b(%rip), %xmm4,
vxorps
%xmm4, %xmm4, %xmm4
vinsertf128 $1, %xmm2, %ymm2, %ymm2
vinsertps
$0x10, 52+_b(%rip), %xmm6,
vmovlhps
%xmm0, %xmm1, %xmm1
vmovaps
_a(%rip), %ymm0
vinsertf128 $1, %xmm1, %ymm1, %ymm1
vpermilps
$0, %ymm0, %ymm7
vmulps
%ymm5, %ymm7, %ymm7
vaddps
%ymm4, %ymm7, %ymm7
vpermilps
$85, %ymm0, %ymm6
Even assembly is "too complicated":
CISC: Complex Instruction Set Computing
RISC: Reduced Instruction Set Computing
few:
Image("opt_areas.png")
Much of modern computational science is performed on GNU/Linux clusters where multiple processors can be utilized and many calculations can be run in parallel. The biggest, built from non off shelf hardware, are dubbed "supercomputers".
The first supercomputer of history:
Image(filename="cray.png")
In mid-to-late 1970’s CRAY-1 was the fastest computer in the world.
Clock speed of 12.5ns (80MHz) Computaional rate of 138 MFLOPS during sustained period. Unveiled in 1976 by its inventor Seymour Roger Cray Had spawned a new class of computer called ”The Supercomputer”.
The tree of HPC:
Image("archs.png")
Image("mainframes.png")
Image("cosmic_cube.png")
Micros spawned the Many Parallel Processors era:
Image("mpps.png")
Image("mpps2.png")
Image("goddard.png")
Image("constellation.png")
The current listing of the "biggest of biggest" can be found on www.top500.org (see also www.green500.org)
Image("top500-november-2017-2-1024.jpg")
Distribution of top500 systems in the world
Image("top500-november-2017-9-1024.jpg")
Trend over time:
Image(filename="top500-november-2017-10-1024.jpg")
Band
amount of data moved over time unit. Measured in $bytes\ s^{-1}$ (hard disks) and $bit\ s^{-1}$ (sockets and nodes)
Latency minimum time needed to assign a requested resource
Performance number of 64 bit floating point operations per second
Concurrency A condition of a system in which multiple tasks are logically active at one time.
Parallelism A condition of a system in which multiple tasks are actually active at one time."
Image(filename="parallel.png")
Image("conc_vs_par.png")
Figure from “An Introduction to Concurrency in Programming Languages” by J. Sottile, Timothy G. Mattson, and Craig E Rasmussen, 2010
An example:
Image("web_server_1.png")
The HTML server, image server, and clients (you have to plan on having many clients) all execute at the same time
The problem of one or more clients interacting with a web server not only contains concurrency, the problem is fundamentally current. It doesnʼt exist as a serial problem.
Concurrent application: An application for which the problem definition is fundamentally concurrent.
Image("web_server_2.png")
Another example: Mandelbrot's set
The Mandelbrot set is an iterative map in the complex plane:
$$ z_{n+1}^2 = z_{n}^2 + c $$
where $c$ is a constant and $z_0=0$.
Points that do not diverge after a finite number of iterations are part of the set.
Image("out.png")
To generate the famous Mandelbrot set image, we use the function mandel(C) where C comes from the points in the complex plane. "
The computation for each point is independent of all the other points ... a so-called embarrassingly parallel problem
Image("mandel2.png")
int mandel(double x0, double y0, double tr, int maxiter)
{
double re = 0.0;
double im = 0.0;
double re2 = 0.0;
double im2 = 0.0;
int k;
for (k = 1; k < maxiter && (re2 + im2 < tr); k++)
{
im = 2. * re * im + y0;
re = re2 - im2 + x0;
re2 = re * re;
im2 = im * im;
}
return k;
}
y0 = ymin
for (int i=0; i< height; i++)
{
x0 = xmin;
for (int j=0; j < width; j++)
{
image[i][j] = mandel(x0,y0,horiz,maxiter);
x0 = x0 + xres;
}
y0 = y0 + yres;
}// close on i
Parallel application: An application composed of tasks that actually execute concurrently in order to (1) consider larger problems in fixed time or (2) complete in less time for a fixed size problem.
How much we can gain from the parallelization (2048x2048 pixels, 1000 iterations and horizon at 3):
[g.mancini@zama mandelbrot] time OMP_NUM_THREADS=1 ./mandelbrot.exe
real 0m4.406s
[g.mancini@zama mandelbrot] time OMP_NUM_THREADS=2 ./mandelbrot.exe
real 0m2.726s
[g.mancini@zama mandelbrot] time OMP_NUM_THREADS=4 ./mandelbrot.exe
real 0m2.401s
Key points:
The parallel programming process
Image("ppar_process.png")
EVERY parallel program requires a task decomposition and a data decomposition:
For the Mandelbrot set: map the pixels into row blocks and deal them out to the cores. This will give each core a memory efficient block to work on.
Image("ppar_process2.png")
How a "real life" parallel program is built?
A large fraction of HPC applications (such as Molecular Dynamics) use a message passing notation with the Single Instruction Multiple Data or SIMD design pattern.
Image("glue.png")
An easy way to go is “Bulk Synchronous Processing”.
Image("bulk.png")
Is this efficent?
How we can measure the outcome of our parallelization effort?
Let's consider the Speedup that we can gain from parallelization; a simple meaure would be:
$$ S = \frac{T_{ser}}{T_{par}} $$
Molecular Dynamics
Simulate the motion of atom as point masses intercating by classical laws: $$ m_i \frac{d^2}{dt^2}R_i(t) = -\nabla V(R) $$ $$ V = V_{bond} + V_{angle} + V_{tors} + V_{Coul} + V_{LJ}$$
Image("PES.png")
Image("image.png")
Image("charmm_bench2.png")
the speedup is not constant with the number of processors; why?
let's get back to our Mandelbrot toy:
Image("kcache.png")
not every part of the code is run in parallel; even in a utopistic situation of linear speedup of parallel code that cost would stay fixed (for a fixed problem size).
unsigned char colour[3];
FILE * fp = fopen(filename,"w");
fprintf(fp,"P3\n%d %d %d\n",width, height, 255);
for (int i=0; i< height; i++)
for (int j=0; j < width; j++)
{
if(image[i][j] >= maxiter)
{
colour[0] = 0;
colour[1] = 0;
colour[2] = 0;
}
else
{
colour[0] = 255;
colour[1] = 255;
colour[2] = 255;
}
fprintf(fp,"%d %d %d ",colour[0],colour[1],colour[2]);
}
Amdhal Law
Consider a generic program running in serial mode, taking $T_{t}$ to be completed and made up by serial part and a parallelizable part. Then $$ T_t = T_{ser} + T_{par} = f_s*T_{ser} + f_p*T_{ser}$$ where $T_{ser}$ is the time needed to run in serial and $f_s$, $f_p$ are the serial and parallel parts of the code
now, running in parallel we have (for a linear speedup):
$$T_{par} = \frac{T_{ser}}{p} = T_{ser}*f_s + f_p*\frac{T_{ser}}{p} = (\alpha+\frac{1-\alpha}{p})*T_{ser}$$
if $\alpha=f_s$. The total speedup is then:
$$ S = \frac{T_{ser}}{T_{par}} = \frac{1}{\alpha+\frac{1-\alpha}{p}}$$
which goes to $1/\alpha$ with many processors
the maximum speedup for Mandelbrot is
1/0.27
but there's still work to do:
4.406/2.726
Image("charmm_bench3.png")
Two major sources of parallel overhead:
Load imbalance: the slowest process determines when everyone is done. Time waiting for other processes to finish is time wasted.
Communication overhead: A cost only incurred by the parallel program. Grows with the number of processes for collective comm
Image("charmm_bench.png")
Weak scaling
However: non bonded interactions (all vs all) scale as N^2 while bond scales as N (an atom tipically has that many neighbours). Hence, a serial (or less parallel) part may grow less than the parallel one for some problems.
$$ S(P) \to S(P,N) $$ $$ S(P,N) = \frac{T_{ser}}{T_{par}} = \frac{1}{\alpha+\frac{1-\alpha}{p}}$$ $$ N \to \infty \Rightarrow \alpha \to 0 $$ $$ S(P,N)_{\alpha \to 0} = P $$
Image("dense_matrices.png")
Image(filename="envs90.png")
Environments from the literature 2010-2012:
Image("envs10.png")
Image("von_neumann.png")
From Wikipedia: The von Neumann architecture is a computer design model that uses a processing unit and a single separate storage structure to hold both instructions and data.
Image("cpu_layout.svg.png")
Image("onecore.png")
Instructions and data must be continuously fed to the control and arithmetic units, so that the speed of the memory interface poses a limitation on compute performance (von Neumann bottleneck).
The architecture is inherently sequential, processing a single instruction with (possibly) a single operand or a group of operands from memory. SISD (Single Instruction Single Data) has been coined for this concept.
All the components of a CPU core can operate at some maximum speed called peak performance
The performance at which the Floating Point units generate results for multiply and add operations is measured in floating-point operations per second (Flops/sec).
Typical single core perfomances for latest (e. g. Coffee lake) Intel architectures reach about 100 GFlop/s.
Feeding arithmetic units with operands is a complicated task. The most important data paths from the programmer’s point of view are those to and from the caches and main memory (see later). The performance, or bandwidth of those paths is quantified in GBytes/s.
Fathoming the chief performance characteristics of a processor or system is one of the purposes of low-level benchmarking such as the vector triad
double start end,mflops;
timing(&start); //a generic timestamp function
for(k=0; k<NITER; k++)
{
for(i=0; i<N; ++i)
a[i] = b[i] + c[i] * d[i];
if(a[2]<0.) dummy(a,b,c,d); //prevent optimization
}
timing(&end);
mflops=2.0*NITER*N/(end-start)/1000000.0;
Two dimensions:
- Frequency of CPU
- number of CPUs
we got seven:
Image("perfo_dim.png")
1.Latency
- Each instruction takes a certain time to complete.
- Amount of time between instruction issuing and completition
Throughput
- The number of instructions that complete in a span of time.
Assembly line: workstations that perform a single production step before passing the partially completed automobile to the next workstation. When Henry Ford(1) introduced it in 1920, 31 workstations could assemble a Model T car in about 1,5 hrs.
Early platforms such in 1960 used to have multiple parallel units (i. e. multiple workers) performing arithmetic and logic operations.
The first "dedicated workers" performing tasks a single task before passing data to the next were introduced in the Cray-1 which had a pipeline of 12 stages.
If it takes m different steps to finish the product, m products are continually worked on in different stages of completion. If all tasks are carefully tuned to take the same amount of time (the “time step”), all workers are continuously busy. At the end, one finished product per time step leaves the assembly line.
The most simple setup is a “fetch–decode–execute” pipeline, in which each stage can operate indepen- dently of the others. While an instruction is being executed, another one is being decoded and a third one is being fetched from instruction (L1I) cache.
Breaking up tasks in many different elementary stages (one of the reasons behind RISC):
Image("ILP_wall.png")
For a pipeline of depth $m$, executing $N$ independent, subsequent operations takes $N + m − 1$ steps. We can thus calculate the expected speedup versus a general-purpose unit that needs $m$ cycles to generate a single result:
$$ \frac{T_{seq}}{T_{pipe}} = \frac{mN}{N+m-1} $$
Troughtput is: $$ \frac{N}{T_{pipe}} = \frac{1}{N+m-1} $$
that is for large $N$ speedup $\propto m$ and troughput $\propto 1$. The critical $N_c$ needed to achieve at least a throughput of $p$ ($0\leq p \leq 1$): $$ N_c = \frac{p(m-1)}{1-p} $$
At $p=0.5$ $N_c=m-1$. Think how to manage a pipeline of depth 20 to 30 and possible slow operations (e. g. trascendent functions).
Digression: maybe this was not invented by Ford actually. Do you know what is that:
Image("ars2.jpg")
Image("superscalar.png")
But ... automatic search for independent instructions reqiures extra resources
Out-of-order execution. If arguments to instructions are not available in registers “on time,” e.g., because the memory subsystem is too slow to keep up with processor speed, out-of-order execution can avoid idle times (also called stalls) by executing instructions that appear later in the instruction stream but have their parameters available. This improves instruction throughput and makes it easier for compilers to arrange machine code for optimal performance. Current out-of-order designs can keep hundreds of instructions in flight at any time, using a reorder buffer that stores instructions until they become eligible for execution.
Image("ooe.png")
Image("vectors.png")
Small, fast, on-chip memories serve as temporary data storage for holding copies of data that is to be used again “soon,” or that is close to data that has recently been used. Enlarging the cache size does usually not hurt application performance, but there is some tradeoff because a big cache tends to be slower than a small one.
Also:
Image("transisto_cores.jpg")
Pipelined architectures, however performant, will inevitably have stalls. One way to minimize the cost of stalls would be to be keep stalling units (workstations of the assembly line) in the execution of one program occupied by another one. This is called Hardware threading or multithreading or hyper-threading.
Multithreading is implemented by creating multiple sets of registers and latches to hold the state of the multiple programs. The correct register set can be selected for the right pipeline resource by a very smart compiler
Most integer/logic instructions have a one-cycle execution latency:
Floating-point latencies are typically multi-cycle:
... too early but
if you really need to get as much as possible from this:
Image("500eurofr_HR.jpg")